QUANTITATIVE COMPARISONS BETWEEN FINITARY POSTERIOR 
DISTRIBUTIONS AND BAYESIAN POSTERIOR DISTRIBUTIONS 



FEDERICO BASSETTI 

Abstract. The main object of Bayesian statistical inference is the determination of pos- 
terior distributions. Sometimes these laws are given for quantities devoid of empirical 
value. This serious drawback vanishes when one confines oneself to considering a finite 
horizon framework. However, assuming infinite exchangeability gives rise to fairly tractable 
a posteriori quantities, which is very attractive in applications. Hence, with a view to a 
reconciliation between these two aspects of the Bayesian way of reasoning, in this paper 
we provide quantitative comparisons between posterior distributions of finitary parameters 
and posterior distributions of allied parameters appearing in usual statistical models. 



1. Introduction 

In the Bayesian reasoning the assumption of infinite exchangeability gives rise to fairly 
tractable a posteriori quantities, which is very attractive in real applications. If observations 
form an infinite exchangeable sequence of random variables, de Finetti's representation the- 
orem states that they are conditionally independent and identically distributed, given some 
random parameter, and the distribution of this random parameter is the center of the current 
Bayesian statistical inference. The theoretical deficiency of this approach lies in interpreting 
these parameters. In fact, as pointed out for the first time by de Finetti (see and also 

0] ) , parameters ought to be of such a nature that one should be able to acknowledge at least 
the theoretical possibility of experimentally verifying whether hypotheses on these parame- 
ters are true or false. A closer look to the usual Bayesian procedures shows that Bayesian 
statisticians often draw inferences (from observations) both to empirical (i.e. verifiable) and 
to non empirical hypotheses. To better understand this point, it is worth stating a more com- 
plete formulation of the already mentioned de Finetti's representation theorem: A sequence 
(£,n)n>i of random elements taking values in some suitable measurable space {X, X) (e.g. a 
Polish space), is exchangeable if and only if the empirical distribution 

n 

i— 1 

converges in distribution to a random probability p with probability one and the £„s turn out 
to be conditionally independent given p, with common distribution p. Hence, it is p that 
takes the traditional role of parameter in Bayesian modeling. However, since p is a limiting 
entity of mathematical nature, hypotheses related to it might be devoid of empirical value. It 
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is clear that this drawback vanishes when one confines oneself to considering a finite horizon 
framework, in which ejv, that is always (at least ideally) observable, takes the place of p. 
In this way one preserves the hypothesis of exchangeability, which is quite natural in many 
statistical problems, but one avoids the problem of assessing probability law to unobservable 
entities. In particular, in this context, the conditional distribution of the empirical measure 
ej\[ given := (£1, . . . , £ n ) (n < N) takes the place of the conditional distribution of p given 
given i.e. the usual posterior distribution of the Bayesian (nonparametric) inference. 

Even if, in view of de Finetti's representation, the parameter corresponding to the so- 
called "unknown distribution" (i.e. p) is the limit, as N — > +00, of empirical distribution, it 
should be emphasized that in the Bayesian practice two conflicting aspects sometimes occur. 
On the one hand, statistical inference ought to concern finitary and, therefore, observable 
entities whereas, on the other hand, simplifications of a technical nature can generally be 
obtained by dealing with (parameters defined as function of) the "unknown distribution" p. 
Hence, it is interesting to compare the conditional distribution of ejv given £(n) with the 
conditional distribution of p given £(n), when (£fc)fc>i is an infinite exchangeable sequence 
directed by p. This is the aim of this paper, that can be thought of as a continuation of the 
papers and Q , where specific forms of (finitary) exchangeable laws have been defined and 
studied in terms of finitary statistical procedures. 

The rest of the paper is organized as follows. Section [5] contains a brief overview of 
the finitary approach to statistical inference together with some examples. Sections [3] and 0] 
deal with the problem of quantifying the discrepancy between the conditional law of e^r given 
£(n) and the conditional law of p given £ (n) . 

To conclude these introductory remarks it is worth mentioning which, to some 
extent, is connected with our present work. In point of fact, in Q, Diaconis and Freedman 
provide an optimal bound for the total variation distance between the law of (£i, . . . , £„) and 
the law of (£1, • ■ ■ , ( n ), (£1, • ■ ■ An) being a given finite exchangeable sequence and (Cfc)fc>i a 
suitable infinite exchangeable sequence. 

2. Finitary statistical procedures 

As said before, we assume that the process of observation can be represented as an 
infinite exchangeable sequence (£fc)fc>i of random elements defined on a probability space 
(J7,J-", P) and taking values in a complete separable metric space (X, d), endowed with its 
Borel (7-field X. Let Po be a subset of the set P of all probability measures on (X, X) and 
let t : Po — > be a parameter of interest, being a suitable parameter space endowed with 
a cr-field. From a finitary point of view, a statistician must focus his attention on empirical 
versions t(ei^) of the more common parameter t(p). 

It might be useful, at this stage, to recast the decision theoretic formulation of a 
statistical problem in finitary terms. Usually one assumes that the statistician has a set B of 
decision rules at his disposal and that these rules are defined, for any n < N, as functions from 
X™ to some set A of actions. Then one considers a loss function L, i.e. a positive real-valued 
function on x A, such that L(9, a) represents the loss when the value of t(e/v) is 8 and the 
statistician chooses action a. It is supposed that 

r N (8(t(n))) := E[L(t(e N ), *(£(n)))|£(n)] 

is finite for any 5 in B and, then, rjy(-) is said to be the a posteriori Bayes risk of 5(£(n)). 
Moreover, a Bayes rule is defined to be any element 5fb of B such that 

nv(foi3 (£(«))) = minrjv(5(£(n))) 
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for any realization of £(n). We shall call such a Bayes rule finitary Bayes estimator in order 
to distinguish it from the more common Bayes estimator obtained by minimizing 

r(J(£(n))) := E[L(t(p),6(£(n)))\£(ri)]. 

While the law of the latter estimator is determined by the posterior distribution, that is the 
conditional distribution of p given £(n), the law of a finitary Bayes estimator is determined by 
the "finitary" posterior distribution, that is the conditional distribution of t(ejv) given £(n). 
A few simple examples will hopefully clarify the connection between the finitary Bayesian 
procedures and the usual Bayesian ones. In all the examples we shall present, observations 
are assumed to be real-valued, that is (X, X) = (K, B(R)), the space of actions is some subset 
of R and the loss function is quadratic, i.e. L(x,y) = \x — y\ 2 . It is clear that, under these 
hypotheses, 

5 FB (Z(n))=E[t(e N )\Z{n)] 
and the usual Bayes estimator is given by 

E[t(p)|£(n)]. 

Example 1 (Estimation of the mean). Suppose the statistician has to estimate the mean 
under the squared error loss, i.e. the functional of interest is t{p) := j R xp(dx). The usual 
Bayes estimator is 

A„ :=E[£ n+ i|£(n)] 
while the "finitary Bayes" estimator is 

n _ N — n „ 

f-FB = -jyHn H — Mn, 

where 

1 ™ 

Jj-n = -V] &■ 

Note that in this case the finitary Bayes estimator is a convex combination of the usual Bayes 
estimator with the empirical (plug-in) estimator p, n . 

Example 2 (Estimation of the variance). Now, consider the estimation of the variance t(p) = 
j R x 2 p(dx) — (j R xp(dx)) 2 , under the squared error loss. In this case the space of actions is 
M+ and the usual Bayes estimator is 

-2 _ -2 

a n '■— s n ~ c l,2,n 

where 

4 == E[C+Mn)] and ci, 2 ,„ := £[<£„+i<£„+2|£(n)]. 
Some computations show that the "finitary Bayes" estimator is 

„ 2 _ n _ 2 N-n + n/N-l^ n 2 _ (N — n)(N n — 1) „ 2{N - n)n _ , 

a FB - Jj s n + S n ~ J^2 C h2.n C l,2,n ^2 



where 

1 n 1 n n 
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Example 3 (Estimation of the distribution function). Assume one has to estimate t(p) = 
F p (y) — p{(— oo,y]}, where y is a fixed real number. Under the square loss function, the 
classical Bayes estimator is 

E(i ( -oo,,](6 l+ i)iew) 

while the "finitary Bayes" estimator is 

F FB {y) = jjE n (y) + _^E(I ( _ 00il , ] (&, +1 )|£(n)) 
where E n (y) = ^T,i=ih-°o,v](ti)- 

Example 4 (Estimation of the mean difference). Estimate the Gini mean difference 



t(p) = A(p)= / \x- 

JM 2 



y\p{dx)p(dy) 



under the squared error loss. The usual Bayes estimator is 

E(|£ n+1 -£ n+2 ||£(n)) 
while the "finitary Bayes" estimator is 

E(A(e N Mn)) = ^A„ + {N ^ {N - n) E(|g„ +1 - M\Z(n)) 

+ ^^£ E(l ^- +lllc(n)) ' 

j<n 

where 

It is worth noticing that in all the previous examples when N goes to +oo the "finitary 
Bayes" estimator converges to the usual Bayes estimator, while the "finitary Bayes" estimator 
becomes the usual plug-in frequentistic estimator if n = N . 

3. Comparison between posterior distributions of means 

Let Q be the probability distribution of p. Then Q turns out to be a probability 
measure on ¥(X). Without loss of generality consider P = ¥(X) endowed with a bounded 
metric A which induces the weak convergence on P (e.g. the Prohorov metric), and set V for 
its Borel er-field. In what follows, if necessary, expand (O, T, P) in order to contain all the 
random variables needed and, for any random variable V, let Cv designate the probability 
distribution of V and, for any other random element U, by Cv\u denote some conditional 
probability distribution for V given U. In particular, £ gjv ^( n ) will denote (a version of) the 
conditional distribution of e~N given £(n) := . . . ,£„) and Cput n \ will stand for (a version 
of) the conditional distribution of p given £(n), i.e. the so-called posterior distribution of p. 
Such distributions exist since (P, A) is Polish. 

As already said, the main goal of this paper is comparing Cg N ur n -j with Cpui n y We 
start by comparing posterior means. Indeed, as we have seen in the previous section, the 
posterior mean of a function / appears in many natural statistical estimation problems. For 
the sake of notational simplicity, for any measurable real-valued function /, set 

N 



esU) ■= I f(x)e N (dx) = j-f2m) 

Jx i=l 
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and 



P(f) ■■= I f{x)P{dx). 
Jx 



First of all we prove this very simple 

Proposition 3.1. Given a real-valued measurable function f, if P{p(\f\) < +00} = 1, then 
6at(/) converges in law to p(f) (as N — > +00). Analogously, £e N (f)\Z(n) converges weakly 
(almost surely) to Cp^j^^ n y 

Proof. Let <f> be a bounded continuous function with c = ||0||oo- Then 4>(eN) < c. Now, 
E(<f)(e~N(f)\P) converges almost surely to E((j)(p(f))\p). To see this, note that, conditionally 
on p, e_/v(/) is a sum of independent random variables with mean p(f) and absolute moment 
p(\f\), and, since p(|/|) is almost surely finite, the conditional law of e./v(/) given p converges 
almost surely to p(f), and hence also in law. Since \E((f>(eN(f)\p)\ < c almost surely, to 
conclude the proof it is enough to apply the dominated convergence theorem. The second 
part of the theorem can be proved in the same way conditioning with respect to (p, £,(n)). 



In order to give a quantitative version of the previous statement we resort to the so- 
called Gini-Kantorovich-Wasserstein distance. Let Pi = Pi(IR' i ) be the subset of the set P = 
P(R d ) of all probability measures on B(W l ) defined by Pi :={peP: f Rd \\x\\p(dx) < +00}, 
where || • || denotes the Euclidean norm on R d . For every couple of probability measures (p, q) 
in Pi x Pi the Gini-Kantorovich-Wasserstein distance (of order one) between p and q is defined 

by 

wi(p,q) :=inf \ / \\x — y\\^/(dxdy) : 7 G M.(p, q) 



Ai(p,q) being the class of all probability measures on (R 2xd , B(M. 2xd )) with marginal distri- 
butions p and q. For a general definition of the Gini-Kantorovich-Wasserstein distance and its 
properties see, e.g., (30j | . If Zi and Z 2 are two random variables with law p and q respectively, 
wi(Zi, Z 2 ) will stand for Wi(p, q). 

Proposition 3.2. Given a real-valued measurable function f such that E[/(£i) 2 ] < 00, then 

,2\l/2 



MM/),Rf))<-^=(Ei/(£i)- p(f)\ 2 ) 1 < -^vnmn 



Moreover, 



n I 1 " 



N \ n 

\ i=l 



^==(E[/(e„ + i) 2 |e(n)]) 1/2 (a.e.). 
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Proof. Applying a well-known conditioning argument, note that 
M^N(f),p(f))<E\e N (f)-p(f)\ 

- N 



= E[E(\e N (f)-p(f)\\p)]=E 



E 



£/&) 



fp(dx) 



(by the Cauchy-Schwartz inequality) 



< 



< 



1 



E 



E 



(/&) - / /P 



(by the Jensen inequality) 

1/2 



1/2 



TV 



E 



(/(a 



/p") 



Clearly E[(/(&) - J fP) 2 ] 1/2 < (2(E[/(^) 2 + (/ fp) 2 ) 1 ' 2 and, by the Jensen inequality, 
^(/ /p) 2 ^ ^(j / 2 P) = E[/(Ci) 2 ]- As for the second part of the proposition, first note that 

N 



W l( C p(fm(n)> C £N(fm(n)) ^ 



N-n, 



N 



-E 



1 



N -n 



£(n) 



n 



1 n 

-£/&)-p(/) 



j=i 



Now, take the conditional expectation given (p, and use again the Cauchy-Schwartz 

inequality to obtain 

JV 



E[ 



1 



N — n 



z— n+l 



< 



1 



=E[E(i/(e„+i)-p(/)i 2 ip,ew)|ew] 1/2 . 



- n 

Finally, to complete the proof, apply the Jensen inequality and argue as in the previous part 
of the proof. <0> 

Of course, the mean is not the unique interesting functional which appears in statistical 
problems. For instance, statisticians frequently deal with functionals of the form 



hip) 



X* 



f(xi, . . .,Xk)p(dxi) . ..p(dx k ), 



or even of the form 



t 2 (p) = argmin ege / fg{x\, . . . , x k )p(dxi) . . .p(dx k ). 

Think, for example, of the variance or of the median of a probability measure, respectively. 
It is immediate to generalize Proposition 13.11 according to 

Proposition 3.3. Given a measurable junction f : X k — > M such that 

p { I 1/(^1. ■ • - ,x k )\p(dxi) . ..p(dx k ) < +00} = 1, 
Jx k 

then ti(ejv) converges in law to ti(p) (as N — > +00). Analogously, C^( n )\ tl (e N ) converges 
weakly (almost surely) to £f(n)|*i(p)- 
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As far as functional^ of the type of t 2 are concerned, the situation is less simple. As a 
general strategy, one could apply the usual argmax argument. See, e.g., [33I ]. To do this, set 

Mjv(0) := / fe(x 1: . . . ,x k )e N (dx 1 ) . . .e N (dx k ), 
Jx k 

fe{xi, . . . , x k )p(dx 1 ) . . .p{dx k ) 

X k 

and 6m = *a(ejv)< Assume that is a subset of R d and, for every T C W l define the set Z°°(T) 
of all measurable functions / : T — > K satisfying 

||/|| T :=sup|/(t)| <+oc. 

A version of the argmax theorem (Theorem 3.2.2 in [33] ]) implies that: //Mjv converges in law 
to M m l°°(K) for every compact set K C K , i/ almost all sample paths 9 1— > M(0) are lower 
semi- continuous and possess a unique minimum at a random point 9 — t 2 (p), and if (9n)n>i 
is tight, then 9^ converges in law to 9. 

As for the first hypothesis, that is Mjy converges in law to M in l°°(K) for every 
compact set K C M d , one can resort to Theorems 1.5.4 and 1.5.6 in [33]. Such theorems imply 
that if (Mjv(0i), • • ■ ,M N (9 k )) converges in law to (M(6>i), . . . ,M{9 k )) for every k and every 
(01, ... , 6k) in K k and if, for every e and rj > 0, there is a finite partition {Ti, . . . , Tn} of K 
such that 

(1) limsupP{sup sup |Mjv(/ii) - M N (h 2 )\ > e} < 77 

AT i h 1 ,h 2 £T i 

then Mat converges in law to M in l°°(K) for every compact set K C M. d . Hence, one can try 
to show that 

l/e^xi, ...,x k )- fe 2 (xi, . . .,x k )\ < g{\\9 x - 9 2 \\2)^{xi, ...,Xk) 
for some continuous function g, with g(0) = 0, and some function (ft such that for some 9q 



P \J J-^ Xl ' ' ■ ' ' Xfe ) + l/floC 1 !) ■ ■ ■ iX^WHdxx) . . -p{dx k ) < +00 1 = 1. 

If these conditions hold, then the convergence of Mn to M is easily proved, whereas both 
tightness of (9n)n>i and uniqueness of 9 require additional assumptions. 

Here is an example, where med(p) denotes the median of the distribution p. 

Proposition 3.4. Let M N = med(e 2 N+i)), that is M N = £(jv+i) — ' ' ' — £(2iV+i)- V 

P |y |a;|p(cZx) < +00 j- = 1 and P{med(p) is unique} = 1, 
then Mn converges in law to med(p) as N diverges. Analogously, if, for some n < N , 

|ar|p(eia!) < +00 £(n) > = 1 (a.e.) 



and 

£(n) h- > P{med(p) is unique \£(n)} = 1, (a.e.), 
tften jC^jvlfCr!,) converges weakly (almost surely) to £ me d(p)|£( n ) as N diverges. 
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Proof. In this case 



M N (0) = / \x - e\de 2N+1 



and 



M(h) = / \x-9\dp. 
Js. 



Since P {J R \x\p(dx) < +00} = 1, from Proposition EH1 we get that (Mjv(0i), . . . ,M N (6 k )) 
converges in law to (M(0i), . . . ,M(0k)) for every k and every (61, ... ,9k). Moreover 

\M N (9 1 )-M N (e 2 )\ < fa-hl, 

hence ([T]) is verified. It remains to prove the tightness of (Mjv)jv>i- First of all observe that 
if X\, . . . ,X2N+i are i.i.d random variables with common distribution function F then the 
distribution function of the median of Xi, . . . , X 2 n+i is given by 

2W+1 /r, , i\ 1 r F(x) 

x ^ y r n y)F k (x)(i-F(x)) 2N+1 - k = —, — -/ t N (i-t) N dt, 

fc= V +1 V k J { A { " B(N + l,N + l)Jo { ' 

where B is the Euler integral of the first kind (the so-called beta function). Hence, denoting 
by F(x) the distribution function of p and setting 

H x {t) = P{F(x) < t}, 

it follows that 

P{M N < x] = E [ B{N + 1,N+ I)' 1 J t N (l - t) N 

= B{N + 1,N+1)- 1 f f t N {l~t) N dtdH x { T ) 
Jo Jo 



dt 



B{N + 1,N f t N (l-t) N [l- H x (t)]dt. 

Jo 



Now, by the Markov inequality, 



[1 - H x (t)] = P{F(x) >t}< jE[F(x)] = P{6 < x}, 



hence, 



P{M N <x}< P{a < x}B(N + 1,N+ I)' 1 [ t N {l - t) N dt = P{a < x} 

Jo 



2N+1 
N ' 



In the same way, it is easy to see that 



P{M N <x} = 1-P{M N <x} = l- B(N + 1, N+ l)- 1 [ t N {l-t) N [l-H x (t)]dt 

Jo 

= B(N + 1, N + I)- 1 f t N (l - t) N H x {t)dt 
Jo 

= B(N + 1,N+ l)- 1 [ t N (l-t) N H x (l-t)dt 



and hence 

P{M N >x}< ^±lp{a > a;}. 
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With these inequalities it is immediate to prove the tightness of (Mjv)iV>i- The proof of the 
second part of the proposition is analogous. <0> 

4. Comparing posterior distributions of random probabilities 

We now turn our attention to the comparison of Ce N \^( n ) with Cp\^ n y We shall use 
the Gini-Kantorovich-Wasserstein distance on the space of all probability measures. The Gini- 
Kantorovich-Wasserstein distance of order 1 (relative to a metric A) between two probability 
measures, say {Qi,Q 2 ), defined on (P,V) is 

W 1 {Q 1 ,Q 2 ) :=mf(/ X{p 1 ,p 2 )T(dp 1 dp2) ■ T 6 M(Qi, Q 2 
Ur 2 

where M(Qi, Q 2 ) is the set of all probability measures on (PxP,P® V) with marginals Q\ 
and Q 2 . Here, it is worth recalling that W\ admits the following dual representation 

Wi(Qx,Q 2 ) = sup{ f f(p)(Qi(dp)-Q 2 (dp)); 

(2) L Jv 

/:P->R, \f(p)- f(q)\<X( P ,q) Vp,ge p}. 

See, e.g., Theorem 11.8.2 in The main goal of this section is to give explicit upper bounds 
for the random variable Wi(£g N j£( n ),£p|£(„)). 

4.1. A first bound for the posterior distributions. There is a large body of literature 
on the rate of convergence to zero (when N diverges) of 

where :— J2iLi^ z 1 ^/^ an d ( z j P ^)i>i is a sequence of independent and identically dis- 
tributed (i.i.d.) random variables taking values in X, with common probability measure p. 
See, for instance, [l|, Q and (H)]. The next lemma shows how these well-known results can 
be used to get a bound for W\(Cp\^ n ), Ce N \^( n ))- 

Lemma 4.1. Assume that X is bounded and satisfies 

(3) X(p,e Pl + (1 - e)p 2 ) < eX(p, Pl ) + (1 - e)X(p,p 2 ) 

for every e in (0,1) and every p,p\,p 2 in P. Moreover, let K := sup{A(p, q) : (p, q) <E P 2 }. 
Th, 



en 



f nK 

( 4 ) W l(£p\t;(n),£e N \t(n)) < J E N-n(j>)Cp\Z{n){dp) + — 

holds true for P -almost every £(ri). 

Proof. First of all, note that for every A in V 



where , according to our notation, Ce N \^( n ),p denotes (a version of) the conditional distribution 
of e-N given (£(n),p). Hence, from the dual representation (|2|) of W\ it is easy to see that 

W l(£p\S{n),£e N \£(n)) < / Wl (Sp, £e N \£(n),p) £p|f(n) ( d P)- 

Jf 

Now, write 

n _ N — n _ 

C-N = ~j^ e " ^ Jf e N,n 
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with e/v,n = Siln+i ^5i/(-^~ n )' an d observe that e n and eAr iM are conditionally independent 



given p. Moreover, ejv, n has the same law of e^-n and Wi (5 P , Q) = L A(p, q)Q(dq). Hence, 
Jf 

i— 1 z 

7V-n 



E 



A(p, g)£g JV | C( „) i p(dg) 



N — n 
< — E 



AT 



nK N-n . . nK 

— = — EN - nip) + —- 



The thesis follows from integration over P with respect to £pur n )- 

In the next three subsections we shall use the previous lemma with different choices 
of X and A. 



4.2. The finite case. We start from the simple case in which X — {ai, . . . , a^}. Here P can 
be seen as the simplex 

k 

4={.T£K fc :0<3; 1 <l,!=l...,fc,^a; I = l}. 

i=l 

Define A to be the total variation distance, i.e. X(p, q) — , \p(a>i) — 5( a «')l- ^ n point of 

fact, it should be noted that, since X is finite, there is no difference between the strong and 
the weak topology on P. In this case, for every j = 1, . . . , k, one has 

ejv(aj) = %{i « 1 < i < N}/N. 

Now, denoting by Zi a binomial random variable of parameters (N — n,pi) (pi := p(ai)), we 
get 



E 



(p) 

N-n 



1 



-— J2n\Z t - (N - n) Pl 



2 (iV-n)^ 

= 2(ivb) n) Pl (l - P~) = ^=L= £ V^WO 

< 



V(iV-n) 



Observing that := sup{TI/(p, : p,q E Sk} < 1 and that the total variation distance 
satisfies ([3]), Lemma |4~T1 gives 

Proposition 4.2. If X = {a±, . . . , a*;}, i/ien 



Wi(£^|5(„),£g N |5(„)) < 



4V77" 
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4.3. The case X = K. Passing to a general Euclidean space we first need to choose a suitable 
metric A. We recall that if p and q belongs to P(M d ), the so-called bounded Lipschitz distance 
(denoted by (3) between p and q is defined by 



(3(p, q) = sup ( f f(x)\p(dx) - q(dx)]; f : R d 



I/IIbl<i 



where := sup x61! d \f(x)\ + inf {xy)mdxRd \f(x)-f(y)\/\\x-y\\. See Section 11.3 in [§]. 

Note that sup( p 9 ) gP /3(p, q) < 2 and that /? satisfies (3(p,epi + (1 — e)p2) < e/3(p,Pi) + (1 — 
e)/3(p,p2) for every e in (0, 1) and everyp, pi, pi in P. Recall also that (3 metrizes the weak 
topology (see, e.g., Theorem 11.3.3 in [9( ). In what follows we take X = R and A = 0. As 
a consequence of Lemma 14.11 we get the next proposition in which, for every p in P, we set 
F p (x) = p{(-oo,x]}. 



Proposition 4.3. Let X = 

+oo, then the inequalities 



id A = /3. Set A(p) := L ^ F p {t){\ - F p {t))dt. I/E[A(p)] < 



W l(£p\ii(n),£e N \ti(r. 



,)< 
< 



-^=E[A(p)|«„)] + | 



1 



2n 

TV' 



/io/ds irite /or a/Z n < N , with Y := sup„ E[A(p)|£(n)] < +oo, for P -almost every 

Proof. As already recalled, sup( p ^ 6P 2 (3(p,q) < 2 and /3 satisfies ((3]). Using the dual 
representation of w\- which is the analogue of ([2]) with M. d in the place of P and || • || in the 
place of A- it is easy to see that 

(5) P(p,q) < wi(p,q)- 

Moreover, recall that, when X = K, 

(6) W!(p,q)-- 



\F p (x) - F q (x)\dx. 



See, for instance, [30(. For any p in Pi, for the sake of simplicity, set z, 
that combination of ([5]) and © gives 



(p) _ 



and observe 



E 



N-n 



< E 



N-n 



/ ^ - m — E ^ ^ 

r N-n 

/ | (AT - n).Fp(t) - V I(zi<t)|£ft 



N — n 
(by Fubini theorem) 

1 



N-n 



N-r 



\(N-n)F p (t)~ 53l(z,<t)| 



i=i 



rff. 



Now, note that X)i=i™ — *) arc binomial random variables of parameters ((N — n), F p (t j). 
Hence, since 

(7) f J F p {t)(l - F p (t)) dt < +oo 
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holds true P-almost surely, from the Cauchy-Schwartz inequality one gets 

1 



E 



< 



F v (t)[l-F v {t))dt. 



Combination of this fact with Lemma 14.11 and the obvious identity J p A(p)£p^^ (dp) = 
E[A(p)|£(n)] gives the first part of the thesis. To conclude the proof, apply Doob's mar- 
tingale convergence theorem (see, e.g., Theorem 10.5.1 in Q) to E[A(p)|£(n)] in order to 
prove that sup n E[A(p)|£(n)] < +oo almost surely. 

A first simple consequence of the previous proposition is embodied in 



Corollary 1. Let X = [-M, M] for some < M < +oo. Then, 

2M 2n 

holds true for all n < N for P-almost every 



It is worth recalling that A(p) < +oo implies finite second moment for p but not 
conversely (this condition defines the Banach space £2,1, cf. 22], p. 10). It is easy to show 
that if p has finite moment of order 2 + 5, for some positive 5, then 



(8) 



A(p) < 



l + Cs 



|2+<5 



p(dx) 



1/2' 



holds true with Cs := ^/2(1 + 5) /5. As a consequence of these statements we have the 
following 



Corollary 2. I/E[A(p)] < +00, then 



E[Wi(£p\£( n ),£e N \£(n))} < 



and 



P{Wl (£e N \t(n),£p\S(n)) >e} < 



E[A(p)] 2n 

-n N 

1 



E[A(p)] 2n 
y/N-n N 



hold true for all n < iV. Moreover, «/E|£i| + < +00 /or some positive 8, then 



E[A(p)] < 



2(l+i) E|Ci|2+ , 



1/2' 



Proof. By Proposition ^. 31 whenever E[A(p)] < +00, one can write 

E [Wl(£p|£( n ),£gK|£(n))] < 



-^=E[E[A(p)|eH]] + | 



_ E[A(p)] 2n 
-n N' 

Now, let p(-) = E(p(-)). Then, © together with Fubini theorem and Jensen inequality yield 

1/2I 

l + C s [ I \x\ 2+s p(dx) 



E[A(p)] < 



Combining these facts with Markov inequality completes the proof. (} 
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4.4. The case X = R d . Let X = R d and A = [3. For any p in P and k in N consider 

¥*(p) := f sup e k N(e,e k ^ k ~ 2 \p)] 
Vee(o,i] / 

where N{e,rj,p) is the minimal number of sets of diameter < 2e which cover R d except for a 
set A with < r/. Proposition 3.1 in [f| (see also Theorem 7 in gives 

E [y9(p. 4 P -„ )]<(N- ny 1/k [^ + 4 • 3 2k * k (p)}. 
Using the last inequality and arguing as in the proof of Proposition ^. 31 we obtain the following 
Proposition 4.4. If E[^fe(p)] < +oo for some positive k, then the inequality 

W 1 {C mn) ,U N \ i{n) ) < {N ^ n)1/k ^+4.3 2fe E[v|/ fe (p)|e(n)]) +^ 



< 



4 • 3 2fc y 



2n 

TV' 



(N-n) 1 / k \3 

holds true for all n < N , with Y := sup n E[^E'fe(p)|^(n)] < +oo, for P -almost every t;(n). 

Remark 1. In the last proposition the fact that X = M. d does not play any special role. 
Everything remains true if X is a Polish space. 

Condition E^/^p)] < +oo is almost impossible to check. In what follows we will 
assume a more tractable hypothesis. If f Rd \\x\\' y p(dx) < +oo where 7 = ( k _Jj d k _ 2 ) 1 d> 2 and 
k > d, Proposition 3.4 in 0] (see also Theorem 8 in [3]) yields 

1/7" 



(9) 



1+2 



|a;||' 7 p((ix) 



Using this last inequality we can prove the following 

Proposition 4.5. Let d > 2, k > d and set 7 :— ^ k _J^ d k _ 2 ) 4 Assume that E||£i|| 7 is finite 
and that 7 > 1. If Y n := 2(E[Jj Rd |a:| 7 p((ix)|^(n)]) 1 / 7 , then, for all n < N and for P-almost 
every £(n), one gets Y :— sup„ Y n < +00 and 



, y _ n , /k [| + 4-3». 2^(1 + Y n m + % 



Moreover, 



P{Wi(£ mn)) £e N \ e(n) ) > e} < - 



K 



2n 



{N-n) l / k N 



holds true for all n < N with 



K=-+A-2, 2k - 2 d ' 2 {\ + 2(E|6| 7 ) 1/7 ) 1/2 . 
3 

Proof. Using ^ and applying the Jensen inequality two times, we obtain 
E[* fc (p)|£(n)] < {2 d + 2 d+1 (E[ [ WxPPidxMn)}) 1 ^ 2 . 



Combining Lemma[4]with this last inequality, Doob's martingale convergence theorem, Markov 
inequality and Jensen inequality concludes the proof. <0> 
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4.5. Examples. The application of the theorems of this section essentially require conditions 
on the moments of £i. In the most common cases, the marginal distribution of each obser- 
vation is available. Indeed, from a Bayesian point of view, the marginal distribution of each 
observation is usually treated as a prior guess of the mean of the unknown p. In the next three 
examples we review a few classical Bayesian nonparametric priors from this perspective. 

Example 5 (Normalized random measures with independent increments). Probably the most 
celebrated example of nonparametric priors is the Dirichlet process, see, for example, 13, 1 1 1 - 



A class of nonparametric priors which includes and generalizes the Dirichlet process is the 
class of the so called normalized random measures with independent increments, introduced 
in [3l[ and studied, e.g., in [26|, 23, 24, 17, [HI]. To define a normalized random measure 



with independent increments it is worth recalling that a random measure p with independent 
increments on M. d is a random measure such that, for any measurable collection {A\, . . . , A^} 
(k > 1) of pairwise disjoint measurable subsets of R d , the random variable fi(Ai), . . . , p.(Ak) 
are stochastically independent. Random measures with independent increments are completely 
characterized by a measure v on R d x R + via their Laplace functional. More precisely, for 
every A in B(W 1 ) and every positive A one has 

E{e- X ^ A) ) = exp (- [ (1 - e~ Xv )v(dxdv) 

I J AxR+ 

A systematic account of these random measures is given, for example, in [ljjj . Following pjlj ]. if 
Im d xR+ 0-~ e ~ Xv ) l/ (dxdv) < +oo for every positive A and v(M. d x M + ) = +oo, then one defines a 
normalized random measure with independent increments puttin g p( -) := /S(-)//t(M ). In point 
of fact, under the previous assumptions, P{p(R d ) = 0} = 0; see [31(. The classical example is 
the Dirichlet process, obtained with v(dxdv) = a(dx)p(dv) = a(dx)v~ 1 e~ v dv , a being a finite 
measure on M. d . Consider now a sequence (£i)i>i of exchangeable random variables driven 
by p. When v(dxdv) = a(dx)p(dv), then € ^4} = a(A)/a(R d ) for every i > 1. More 

generally, 

P{£; e A} = [ cj)(\) [ e- Xu uv{dxdu)d\, 

JR+ JAxR+ 

where 

0(A) := exp{- / (1 - e- Xv )v{dydv)} 



see, e.g., Corollary 5.1 in [32] • Hence, E||£|| m < +oo if and only if 

0(A) / e~ Xu \\x\\ m uv{dxdu)d\ < +oo. 

R+ JR k xR+ 

Example 6 (Species sampling sequences and stick-breaking priors). An exchangeable se- 
quence of random variables (£ n ) n is called a species sampling sequence (see (28| ) if, for each 
n > 1, 

k(n) 

P{£n+i € A\£(n)} = l Q , n a{A) + l j>n S q (A) {A e X) 

3 = 1 

and 

P{6 eA} = a(A) 

with the proviso that £jf , . . . are the k(n) distinct values of £i, . . . , £ n in the same order as 
they appear, lj^ n (j — 0, . . . , k(n)) are non-negative measurable functions of (£ i , . . . , fw), and 



a is some non-atomic probability measure on (A, X). See, among others, [14|, |13J, [271, l29( . Of 
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course, in this case, it a simple task to check conditions on the marginal distribution of each 
observation, since it coincides with a. A particular kind of random probability laws connected 
with the species sampling sequences are the so-called stick-breaking priors. Such priors are 
almost surely discrete random probability measures that can be represented as 

N 

P(-) = ^Pk5 Zk {-) 

k=l 

where (pk)k>i and (Zk)k>i are independent, < pk < 1 and Ylk=iP k = almost surely, 
and (Zk)k>i are independent and identically distributed random variable taking values in X 
with common probability a. Stick- breaking priors can be constructed using either a finite or 
infinite numbers of terms, 1 < N < +oo. Usually, 

Pi = V 1 Pfc = (l-Vi)(l-V r 2 )...(l-Vfc-i)V r fc k>2 



where Vk are independent Beta(ak, bfe) random variables for > 0, bk > 0. See 1^. It is 
clear that in this case 

P{& E A} = a(A). 

Example 7 (Polya tree). Let X = K and let Ej be the set of all sequences of 0s and Is of 
length j. Moreover, set E* = UjEj. For each n, let T n = {B^ : e £ E n } be a partition of R 
such that for all I in E* , Bgo, B^i is a partition of B s . Finally let N = {ag : e <G E*} be a 
set of nonnegative real numbers. A random probability p on K is said to be a Polya tree with 
respect to the partition T = {T n } n with parameter H if 

• {p ( Bco\Beo) : I € E*} are a set of independent random variables 

• for all I in E* p (£?eo|-Beo) is Beta(cteo, ati). 

See 2j| 2(| 21 1. Under suitable condition on H, such a random probability does exist. See 



Theorem 3.3.2 in |l2j. Moreover, if (£ n ) n >i is an exchangeable sequence with driving measure 
p, for any B s with e = £162 . . . 



e s e -} = ]J 



^ ^6162. ..£^0 T" Q^eie2...eil 

See, e.g., Theorem 3.3.3 in 12]. In this case it is a difficult task to give explicit conditions for 
the existence of the moments of Nevertheless, Lavine suggests that, if the partitions has 
the form F _1 (^2 e</2*, ^2 e «/2 i + 1 /2 Z ) , F being a continuous distribution function, and 



then P{£„ < x} = F{x) 
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